If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here
To define temperature anomalies you need to have a reference, or base, period which NASA clearly states that it is the period between 1951-1980.
weather <-
read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v3/NH.Ts+dSST.csv",
skip = 1,
na = "***")tidyweather <- weather %>%
select(-c("J-D", "D-N", "DJF", "MAM", "JJA", "SON")) %>% # select relevant columns
pivot_longer(cols = 2:13, names_to = "Month", values_to = "delta") # make data frame tidyLet us plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.
In the following chunk of code, I used the
eval=FALSEargument, which does not run a chunk of code; I did so that you can knit the document before tidying the data and creating a new dataframetidyweather. When you actually want to run this code and knit your document, you must deleteeval=FALSE, not just here but in all chunks wereeval=FALSEappears.
tidyweather <- tidyweather %>%
mutate(date = ymd(paste(as.character(Year), Month, "1")),
month = month(date), #had to get rid of label = true. TA advised could be a package loading problem
year = year(date))
ggplot(tidyweather, aes(x=date, y = delta))+
geom_point()+
geom_smooth(color="red") +
theme_bw() +
labs (
title = "Weather Anomalies drastically increasing over time"
)Is the effect of increasing temperature more pronounced in some months? Use facet_wrap() to produce a seperate scatter plot for each month, again with a smoothing line. Your chart should human-readable labels; that is, each month should be labeled “Jan”, “Feb”, “Mar” (full or abbreviated month names are fine), not 1, 2, 3.
It is sometimes useful to group data into different time periods to study historical data. For example, we often refer to decades such as 1970s, 1980s, 1990s etc. to refer to a period of time. NASA calcuialtes a temperature anomaly, as difference form the base periof of 1951-1980. The code below creates a new data frame called comparison that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.
We remove data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().
comparison <- tidyweather %>%
filter(Year>= 1881) %>% #remove years prior to 1881
#create new variable 'interval', and assign values based on criteria below:
mutate(interval = case_when(
Year %in% c(1881:1920) ~ "1881-1920",
Year %in% c(1921:1950) ~ "1921-1950",
Year %in% c(1951:1980) ~ "1951-1980",
Year %in% c(1981:2010) ~ "1981-2010",
TRUE ~ "2011-present"
))Inspect the comparison dataframe by clicking on it in the Environment pane.
Now that we have the interval variable, we can create a density plot to study the distribution of monthly deviations (delta), grouped by the different time periods we are interested in. Set fill to interval to group and colour the data by different time periods.
ggplot(comparison, aes(x=delta, fill=interval))+
geom_density(alpha=0.2) + #density plot with transparency set to 20%
theme_bw() + #theme
labs (
title = "Density Plot for Monthly Temperature Anomalies",
y = "Density" #changing y-axis label
)So far, we have been working with monthly anomalies. However, we might be interested in average annual anomalies. We can do this by using group_by() and summarise(), followed by a scatter plot to display the result.
#creating yearly averages
average_annual_anomaly <- tidyweather %>%
group_by(Year) %>% #grouping data by Year
# creating summaries for mean delta
# use `na.rm=TRUE` to eliminate NA (not available) values
summarise(annual_average_delta = mean(delta, na.rm=TRUE))
#plotting the data:
ggplot(average_annual_anomaly, aes(x=Year, y= annual_average_delta))+
geom_point()+
#Fit the best fit line, using LOESS method
geom_smooth() +
#change to theme_bw() to have white background + black frame around plot
theme_bw() +
labs (
title = "Average Yearly Anomaly",
y = "Average Annual Delta"
) deltaNASA points out on their website that
A one-degree global change is significant because it takes a vast amount of heat to warm all the oceans, atmosphere, and land by that much. In the past, a one- to two-degree drop was all it took to plunge the Earth into the Little Ice Age.
Your task is to construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present.
formula_ci <- comparison %>%
# clean NAs and choose the interval 2011-present
drop_na(delta) %>%
filter(Year >= 2011) %>%
# calculate yearly mean temperature deviation (delta)
group_by(Year) %>%
summarise(year_mean_delta = mean(delta)) %>%
# Confidence Interval (CI) using the formula mean +- MoE
summarise(mean_delta = mean(year_mean_delta), # calculate summary statistics for yearly mean temperature deviation (delta)
sd_delta = sd(year_mean_delta),
count = n(),
t_critical = qt(0.975, count-1), # get t-critical value with (n-1) degrees of freedom
se_delta = sd(year_mean_delta)/sqrt(count), # calculate mean, SD, count, SE, lower/upper 95% CI
margin_of_error = t_critical * se_delta,
delta_low = mean_delta - margin_of_error,
delta_high = mean_delta + margin_of_error)
formula_ci## # A tibble: 1 x 8
## mean_delta sd_delta count t_critical se_delta margin_of_error delta_low
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.973 0.204 9 2.31 0.0680 0.157 0.816
## # … with 1 more variable: delta_high <dbl>
set.seed(1234)
boot_yearly_mean_delta <- comparison %>%
# Get rid of NAs in 2019
drop_na(delta) %>%
# Choose only 2011 and following
filter(Year >= 2011) %>%
# Create yearly mean deltas
group_by(Year) %>%
summarise(year_mean_delta = mean(delta)) %>%
# Specify the variable of interest
specify(response = year_mean_delta) %>%
# Generate a bunch of bootstrap samples
generate(reps = 1000, type = "bootstrap") %>%
# Find the mean of each sample
calculate(stat = "mean")
# Calculate bootstrap method confidence intervals
percentile_ci <- boot_yearly_mean_delta %>%
get_confidence_interval(level = 0.95, type = "percentile")
percentile_ci## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.855 1.10
What is the data showing us? Please type your answer after (and outside!) this blockquote. You have to explain what you have done, and the interpretation of the result. One paragraph max, please!
In the above sections, the confidence interval has been calculated using 2 different method: using CI formula and using bootstrap. In the CI formula method, each year’s annual delta has been calculated by taking the arithmetic mean of all months’ delta in that corresponding year. This yields a total of 9 observations from 2011 until 2019. Then, the standard deviation of the data and standard error of the mean are calculated, and then combined with a t-statistic approximation to obtain a 95% confidence interval for the average annual delta. Meanwhile, in the bootstrap method, repeated samples are taken from the data to provide an estimate of the average annual delta. Using the 1st method, we are 95% confident that the actual average is between 0.816 - 1.13. Using the 2nd method, the confidence interval is narrower: 0.855 - 1.1, see the first plot below. Here, the bootstrap distribution is quite close to normal distribution, as is demonstrated in the second plot below, so the 2 confidence intervals should be similar to some extent.
# Visualise bootstrap CI vs formula CI
visualize(boot_yearly_mean_delta) +
shade_ci(endpoints = percentile_ci,fill = "yellow")+
labs(title='Bootstrap CI for Yearly Mean Temperature Deviation narrower than Formula CI',
subtitle = 'Formula CI shown with dotted red lines, Bootstrap CI in green')+
geom_vline(xintercept = formula_ci$delta_low, colour = "red", linetype="dashed", size=1.2)+
geom_vline(xintercept = formula_ci$delta_high, colour = "red", linetype="dashed", size=1.2)+
theme_bw()+
NULL# compare bootstrap distribution with a Normal distribution with parameters estimated from the sample
ggplot(boot_yearly_mean_delta, aes(x = stat)) +
geom_density(color="blue") +
stat_function(
fun = dnorm,
color = "red",
size = 2,
args = list(mean = formula_ci$mean_delta, sd = formula_ci$se_delta)
)+
theme_bw()+
labs(title = "The Bootstrap distribution is close to a Normal distribution",
x= 'Average rating', y = "")+
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())+
NULLAs we saw in class, fivethirtyeight.com has detailed data on all polls that track the president’s approval
# Import approval polls data
approval_polllist <- read_csv(here::here('data', 'approval_polllist.csv'))
# or directly off fivethirtyeight website
# approval_polllist <- read_csv('https://projects.fivethirtyeight.com/trump-approval-data/approval_polllist.csv')
glimpse(approval_polllist)## Rows: 15,619
## Columns: 22
## $ president <chr> "Donald Trump", "Donald Trump", "Donald Trump", "…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls…
## $ modeldate <chr> "9/27/2020", "9/27/2020", "9/27/2020", "9/27/2020…
## $ startdate <chr> "1/20/2017", "1/20/2017", "1/20/2017", "1/21/2017…
## $ enddate <chr> "1/22/2017", "1/22/2017", "1/24/2017", "1/23/2017…
## $ pollster <chr> "Gallup", "Morning Consult", "Ipsos", "Gallup", "…
## $ grade <chr> "B", "B/C", "B-", "B", "B-", "C+", "B+", "B", "C+…
## $ samplesize <dbl> 1500, 1992, 1632, 1500, 1651, 1500, 1190, 1500, 1…
## $ population <chr> "a", "rv", "a", "a", "a", "lv", "rv", "a", "lv", …
## $ weight <dbl> 0.262, 0.680, 0.153, 0.243, 0.142, 0.200, 1.514, …
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ approve <dbl> 45.0, 46.0, 42.1, 45.0, 42.3, 57.0, 36.0, 46.0, 5…
## $ disapprove <dbl> 45.0, 37.0, 45.2, 46.0, 45.8, 43.0, 44.0, 45.0, 4…
## $ adjusted_approve <dbl> 45.7, 45.3, 43.2, 45.7, 43.4, 51.5, 37.6, 46.7, 5…
## $ adjusted_disapprove <dbl> 43.6, 38.3, 43.9, 44.6, 44.5, 44.5, 42.8, 43.6, 4…
## $ multiversions <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tracking <lgl> TRUE, NA, TRUE, TRUE, TRUE, TRUE, NA, TRUE, TRUE,…
## $ url <chr> "http://www.gallup.com/poll/201617/gallup-daily-t…
## $ poll_id <dbl> 49253, 49249, 49426, 49262, 49425, 49266, 49260, …
## $ question_id <dbl> 77265, 77261, 77599, 77274, 77598, 77278, 77272, …
## $ createddate <chr> "1/23/2017", "1/23/2017", "3/1/2017", "1/24/2017"…
## $ timestamp <chr> "00:45:20 27 Sep 2020", "00:45:20 27 Sep 2020", "…
# Use `lubridate` to fix dates, as they are given as characters.We will calculate the average net approval rate (approve- disapprove) for each week since he got into office, plotting the net approval, along with its 95% confidence interval. There are various dates given for each poll, we will use enddate, i.e., the date the poll ended.
We are facetting by year, and adding an orange line at zero.
glimpse(approval_polllist)## Rows: 15,619
## Columns: 22
## $ president <chr> "Donald Trump", "Donald Trump", "Donald Trump", "…
## $ subgroup <chr> "All polls", "All polls", "All polls", "All polls…
## $ modeldate <chr> "9/27/2020", "9/27/2020", "9/27/2020", "9/27/2020…
## $ startdate <chr> "1/20/2017", "1/20/2017", "1/20/2017", "1/21/2017…
## $ enddate <chr> "1/22/2017", "1/22/2017", "1/24/2017", "1/23/2017…
## $ pollster <chr> "Gallup", "Morning Consult", "Ipsos", "Gallup", "…
## $ grade <chr> "B", "B/C", "B-", "B", "B-", "C+", "B+", "B", "C+…
## $ samplesize <dbl> 1500, 1992, 1632, 1500, 1651, 1500, 1190, 1500, 1…
## $ population <chr> "a", "rv", "a", "a", "a", "lv", "rv", "a", "lv", …
## $ weight <dbl> 0.262, 0.680, 0.153, 0.243, 0.142, 0.200, 1.514, …
## $ influence <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ approve <dbl> 45.0, 46.0, 42.1, 45.0, 42.3, 57.0, 36.0, 46.0, 5…
## $ disapprove <dbl> 45.0, 37.0, 45.2, 46.0, 45.8, 43.0, 44.0, 45.0, 4…
## $ adjusted_approve <dbl> 45.7, 45.3, 43.2, 45.7, 43.4, 51.5, 37.6, 46.7, 5…
## $ adjusted_disapprove <dbl> 43.6, 38.3, 43.9, 44.6, 44.5, 44.5, 42.8, 43.6, 4…
## $ multiversions <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tracking <lgl> TRUE, NA, TRUE, TRUE, TRUE, TRUE, NA, TRUE, TRUE,…
## $ url <chr> "http://www.gallup.com/poll/201617/gallup-daily-t…
## $ poll_id <dbl> 49253, 49249, 49426, 49262, 49425, 49266, 49260, …
## $ question_id <dbl> 77265, 77261, 77599, 77274, 77598, 77278, 77272, …
## $ createddate <chr> "1/23/2017", "1/23/2017", "3/1/2017", "1/24/2017"…
## $ timestamp <chr> "00:45:20 27 Sep 2020", "00:45:20 27 Sep 2020", "…
# Clean data set and prepare to analyse on weekly basis and across years
approval_polllist_clean <- approval_polllist %>%
mutate(enddate = mdy(enddate)) %>% # Use `lubridate` to fix dates, as they are given as characters.
mutate(week_count = week(enddate)) %>% # get week counts
mutate(year = year(enddate)) %>% # get year for facetting
filter(subgroup == "Voters") # Keeping only "Voters" as the subgroup
# Adding net rate to the data set as %
approval_polllist_clean <- approval_polllist_clean %>%
mutate(net_rate = (adjusted_approve - adjusted_disapprove) / (adjusted_approve + adjusted_disapprove) * 100)
#Calculating average net rate, SD, SE & DF to create CIs on a weekly basis
weekly_approval_polllist <- approval_polllist_clean %>%
group_by(year,week_count) %>%
summarise(count =n(),
MEAN = mean(net_rate),
SD = sd(net_rate),
SE = SD/sqrt(count),
DF = count - 1) %>%
mutate(CI_upper = MEAN + qt(.975, DF) * SE,
CI_lower = MEAN - qt(.975, DF) * SE)
# colour hex codes from: https://html-color-codes.info/colors-from-image/
graph_colouring <- c("#F8766D" ,"#7CAE00", "#00BFC4", "#C77CFF")
#Plotting the data
approval_plot <- ggplot(weekly_approval_polllist, aes(x = week_count, y = MEAN, color = as.factor(year))) +
geom_line() +
facet_wrap(~year) +
geom_point(size = 1) +
geom_hline(yintercept = 0, color = "orange") +
scale_x_continuous (limits = c(0, 53),
breaks = c(0, 13, 26, 39, 52),
labels = c("0", "13","26","39","52")) +
scale_y_continuous (limits=c(-22, 9),
breaks=c(-20, -17.5, -15, -12.5, -10, -7.5, -5, -2.5, 0, 2.5, 5, 7.5)) +
geom_ribbon(aes(ymin = CI_lower, ymax = CI_upper, fill = as.factor(year)), alpha = .1) +
labs(y = "Average Net Approval (%)",
x = "Week of the year") +
ggtitle(label = "Estimating Net Approval (approve - dissaprove) for Donald Trump",
subtitle = "Weekly average of all polls") +
theme(title = element_text(size=8),
axis.title = element_text(size=8),
axis.text = element_text(size=8),
axis.ticks = element_blank(),
strip.text = element_text(size=8),
panel.background = element_rect(color="black", fill = "white"),
panel.border = element_blank(),
strip.background = element_rect(color="black", fill="grey", size=.25),
panel.grid = element_line(color = "#CCCCCC"),
legend.position = "none") +
scale_colour_manual(aesthetics = "custom_color_palette") +
#coord_fixed(ratio= 3/5, clip = "on") +
theme(plot.margin = unit(c(1,1,0,1), "cm"))
ggsave("plot1.png",
plot = last_plot(),
scale = 1,
width = 25,
height = 12.5,
units = "cm",
dpi = 300,
limitsize = TRUE)
knitr::include_graphics(here::here("plot1.png"), error = FALSE)What’s going on with the confidence intervals between week 15 (6-12 April 2020) and week 34 (17-23 August 2020)?
The 95% confidence interval for week 15 (-6.45, -9.56) is relatively narrower than the interval for week 34 (-9.68, -13.69) implying a tighter cluster of polling data near the mean. This translates to a higher proportion of people sticking to their approval rating in week 15 - the SD of the polls is lower. Thiscompares to week 34 where the mean approval ratings were generally lower but on a more volatile upwards trend. The volatility might be due to the fact that week 34 contained multiple interesting and polarising events. Both Obamas publicly criticised Trump, Fox News praised Joe Biden’s latest speech and Steve Bannon was indicted. Such events will have angered many undecided voters causign their disapproval to rise, but will potentially also have been received as unfair by hard-core Trump supporters, ultimately icnreasing polling volatility.Finally, we can also see that the confidence intervals for the approval ratings don’t overlap so we can say that we are 95% certain that there has been a shift in the populations approval rating.
Recall the gapminder data frame from the gapminder package. That data frame contains just six columns from the larger data in Gapminder World. In this part, we will join a few dataframes with more data than the ‘gapminder’ package. Specifically, we will look at data on
We will use the wbstats package to download data from the World Bank. The relevant World Bank indicators are SP.DYN.TFRT.IN, SE.PRM.NENR, NY.GDP.PCAP.KD, and SH.DYN.MORT
# load gapminder HIV data
hiv <- read_csv(here::here("data","adults_with_hiv_percent_age_15_49.csv"))
life_expectancy <- read_csv(here::here("data","life_expectancy_years.csv"))
# get World bank data using wbstats
indicators <- c("SP.DYN.TFRT.IN","SE.PRM.NENR", "SH.DYN.MORT", "NY.GDP.PCAP.KD")
library(wbstats)
worldbank_data <- wb_data(country="countries_only", #countries only- no aggregates like Latin America, Europe, etc.
indicator = indicators,
start_date = 1960,
end_date = 2016)
# get a dataframe of information regarding countries, indicators, sources, regions, indicator topics, lending types, income levels, from the World Bank API
countries <- wbstats::wb_cachelist$countriesWe will join the 3 dataframes (life_expectancy, worldbank_data, and HIV) into one. You may need to tidy your data first and then perform join operations. Think about what type makes the most sense and explain why you chose it.
# clean to longer tidy data
hiv_cleaned <- hiv %>%
select(country, c(13:34)) %>%
pivot_longer(cols = 2:23, names_to = "Year", values_to = "HIV_rate_per_100") %>%
mutate(Year = as.numeric(Year)) # create numeric year value
# clean to longer tidy data
life_expectancy_cleaned <- life_expectancy %>%
select(country, c(192:213)) %>%
pivot_longer(cols = 2:23, names_to = "Year", values_to = "life_expectancy") %>%
mutate(Year = as.numeric(Year)) # create numeric year value
worldbank_cleaned <- worldbank_data %>%
# choose only years for which all to be merged data frames have data
filter(date >= 1990,
date <= 2011) %>%
rename(Year = date) %>%
select(c(3:8)) %>%
rename(gdpPerCap = NY.GDP.PCAP.KD,
school_enrollment_rate = SE.PRM.NENR,
mortality_rate_under5 = SH.DYN.MORT,
fertility_rate = SP.DYN.TFRT.IN)
# Merge all data sets
compiled_countries_data <- worldbank_cleaned %>%
left_join(hiv_cleaned, by = c("country", "Year")) %>%
left_join(life_expectancy_cleaned, by = c("country", "Year")) %>%
left_join(countries, by = "country") %>%
select(region, country, Year, gdpPerCap, school_enrollment_rate, mortality_rate_under5, fertility_rate, HIV_rate_per_100, life_expectancy)hiv_life <- compiled_countries_data %>%
group_by(Year, region) %>%
drop_na(HIV_rate_per_100, life_expectancy) %>%
summarise(correlation = cor(HIV_rate_per_100, life_expectancy)) %>%
filter(region != c("North America", "Middle East & North Africa")) # Graph will be faceted by region, so these 2 regions should be excluded because there are not enough samples.
# Plot correlation coefficient - time graph instead of HIV - life expectancy graph. This allows further analysis of how the relationship change over time.
ggplot(hiv_life, aes(x = Year, y = correlation)) +
geom_point() +
facet_wrap(~region, scales = "free_x") + # faceting by region allows data to be grouped by countries with more similar socio-economic conditions.
geom_smooth() +
theme_bw() +
labs(title = "Correlation of HIV Prevalence and Life Expectancy over Time",
subtitle = "North America, Middle East & North Africa excluded due to insufficient samples",
y = "Correlation Coefficient",
x = "Year") In most regions, there has been a negative correlation between HIV prevalence and life expectancy over time. This suggests that the higher the HIV rate, the lower the life expectancy. However, this seems to be the case only for less developed regions. In Europe & Central Asia where living conditions are significantly higher, the correlation coefficient between 1990-2000 is actually positive. Since this is counter-intuitive, the only reasonable explanation would be that as living conditions increase, HIV rate becomes much less relevant in deciding life expectancy. This also explains why Sub-Saharan Africa and South Asia, where economic development was high between 2005-2011, experienced significant decreases in the absolute value of correlation coefficients.
compiled_countries_data %>%
filter(region != "North America") %>%
ggplot(aes(x = gdpPerCap, y = fertility_rate)) +
geom_point() +
facet_wrap(~region, scales = "free") +
theme_light() +
geom_smooth() +
labs(title = "Relationship between fertility rate and income varies by region",
subtitle = "North America excluded due to insufficient samples",
x = "GDP Per Capita",
y = "Fertility rate (Births per woman)") At lower income levels, there is a consistent trend that the higher the average income, the lower the fertility rate (negative correlation). Conversely, at mid-income levels, birth rates actually increase with GDP per capita for most countries (positive correlation), although the degree by which it increases varies greatly by region. Finally, at higher income levels, the correlation becomes negative again.
HIV_missing_data <- compiled_countries_data %>%
group_by(region) %>%
summarise(missing_entries = sum(is.na(HIV_rate_per_100)))
# Compare by number of missing entries
ggplot(HIV_missing_data, aes(x = missing_entries, y = reorder(region, missing_entries))) +
geom_col() +
theme_economist() +
labs(title = "Missing HIV data",
subtitle = "By count of missing observations",
y = "",
x = "Number of missing entries") +
NULLHIV_missing_data_2 <- compiled_countries_data %>%
group_by(region) %>%
summarise(missing_entries = sum(is.na(HIV_rate_per_100)),
count = n()) %>%
mutate(percentage_missing = missing_entries/count)
# Compare by percentage of missing entries
ggplot(HIV_missing_data_2, aes(x = percentage_missing, y = reorder(region, percentage_missing))) +
geom_col() +
theme_economist() +
scale_x_continuous(labels = scales::percent) +
labs(title = "Missing HIV data",
subtitle = "By percentage of missing observations",
y = "",
x = "Percentage") +
NULL East Asia & Pacific has the most observations with missing HIV data, by both count of missing entries and percentage of missing entries.
We will find the top 5 countries that have seen the greatest improvement, as well as those 5 countries where mortality rates have had the least improvement or even deterioration.
mortality_plot <- compiled_countries_data %>%
group_by(region, Year) %>%
summarise(mean_mortality_rate = mean(mortality_rate_under5, na.rm = TRUE))
ggplot(mortality_plot, aes(x = Year, y = mean_mortality_rate)) +
geom_point() +
facet_wrap(~region, scales = "free_x") +
theme_bw() +
labs(title = "Mortality rate decreased over time across all regions",
subtitle = "most significantly in South Asia and Sub-Saharan Africa",
y = "Mortality rate of children under 5 (per 1,000 live births)",
x = "") +
NULL# Collect data for best performing countries
mortality_head <- compiled_countries_data %>%
filter(Year == c(1990, 2011)) %>%
select(region, country, Year, mortality_rate_under5) %>%
pivot_wider(names_from = Year, values_from = mortality_rate_under5) %>%
rename(start = 3,
end = 4) %>%
mutate(change = end - start) %>%
drop_na(change) %>%
arrange(region, change, na.rm = TRUE) %>%
group_by(region) %>%
slice_head(n=5)
# Collect data for worst performing countries
mortality_tail <- compiled_countries_data %>%
filter(Year == c(1990, 2011)) %>%
select(region, country, Year, mortality_rate_under5) %>%
pivot_wider(names_from = Year, values_from = mortality_rate_under5) %>%
rename(start = 3,
end = 4) %>%
mutate(change = end - start) %>%
drop_na(change) %>%
arrange(region, change, na.rm = TRUE) %>%
group_by(region) %>%
slice_tail(n=5)
# Combine the 2 sets of data above
mortality_top_and_bottom <- bind_rows(mortality_head, mortality_tail) %>%
arrange(region, change) %>%
select(region, country, change) %>%
rename(change_in_mortality_rate = change)
mortality_top_and_bottom## # A tibble: 64 x 3
## # Groups: region [7]
## region country change_in_mortality_rate
## <chr> <chr> <dbl>
## 1 East Asia & Pacific Timor-Leste -116.
## 2 East Asia & Pacific Lao PDR -88.2
## 3 East Asia & Pacific Mongolia -80.2
## 4 East Asia & Pacific Cambodia -75.3
## 5 East Asia & Pacific Myanmar -53.9
## 6 East Asia & Pacific New Zealand -5.20
## 7 East Asia & Pacific Singapore -4.9
## 8 East Asia & Pacific Australia -4.70
## 9 East Asia & Pacific Brunei Darussalam -3.5
## 10 East Asia & Pacific Japan -3.10
## # … with 54 more rows
Here, a negative change in mortality rate implies an improvement. Hence, the 5 countries that improved the most are those with the most negative changes in mortality rate, and vice versa for the least improved. The table compiled is for the 1990-2011 period. The North American region only consists out of 2 countries, the USA and Canada, so both countries appear in the top 5 for most and least improvement.
compiled_countries_data %>%
filter(region != "North America") %>%
ggplot(aes(x = fertility_rate, y = school_enrollment_rate)) +
geom_point() +
facet_wrap(~region, scales = "free") +
expand_limits(y = c(0,100)) +
theme_light() +
geom_smooth() +
labs(title = "Varied relationships between fertility rate and primary school enrollment",
subtitle = "North America excluded due to insufficient samples",
x = "Fertility rate (Births per woman)",
y = "Primary School Enrollment (%)") Whilst the relationship between fertility rate and primary school enrollment varies by region, on overall there seems to be a negative relationship between the 2 variables in less developed regions (Latin America & Caribbean, Middle East & North Africa, Sub-Saharan Africa, South Asia, and countries with really high birth rates in East Asia & Pacific). The correlation, however, fades in more developed countries (Europe & Central Asia, and most of East Asia & Pacific). As with the HIV prevalence - life expectancy case, the most plausible explanation is that as living conditions increase, fertility rates become much less relevant in deciding primary school enrollment.
Let us revisit the CDC Covid-19 Case Surveillance Data. There are well over 3 million entries of individual, de-identified patient data. Since this is a large file, I suggest you use vroom to load it and you keep cache=TRUE in the chunk options.
# file contains 11 variables and 3.66m rows and is well over 380Mb.
# It will take time to download
# URL link to CDC to download data
url <- "https://data.cdc.gov/api/views/vbim-akqf/rows.csv?accessType=DOWNLOAD"
covid_data <- vroom::vroom(url)%>% # If vroom::vroom(url) doesn't work, use read_csv(url)
clean_names()We will in the following show death % rate:
covid_plot1_data <- covid_data %>%
#remove all observations where relevant data are missing
mutate_if(is.character, list(~na_if(., "Unknown"))) %>%
mutate_if(is.character, list(~na_if(., "Missing"))) %>%
mutate(sex = na_if(sex, "Other")) %>% # to ensure sex response "Other" is dropped with other NAs
drop_na(age_group, sex, medcond_yn, death_yn) %>% # drop NAs
group_by(age_group, sex, medcond_yn, death_yn) %>%
summarise(count = n()) %>%
ungroup() %>%
group_by(age_group, sex, medcond_yn) %>%
mutate(subgroup_total = sum(count)) %>%
ungroup() %>%
filter(death_yn == "Yes") %>%
mutate(death_percentage = round(count/subgroup_total, digits = 3),
medcond_yn = factor(medcond_yn, levels = c("Yes", "No"),
labels = c("With comorbidities", "Without comorbidities"))) # Rename facet labels
new <- theme_set(theme_bw())
theme_replace(plot.tag.position = c(0.95,0.01),
plot.tag = element_text(size = 9)) # Custom format tag position and font size for "Source: CDC"
covid_plot1 <- ggplot(covid_plot1_data, aes(x = death_percentage, y = reorder(age_group, age_group))) +
geom_col(fill = "#6666CC") +
facet_grid(medcond_yn ~ sex) +
coord_cartesian(xlim = c(0, 0.7), clip = "off") +
scale_x_continuous(labels = scales::percent) +
theme_set(new) +
labs(title = "Covid death % by age group, sex and presence of co-morbidities",
x = "",
y = "",
caption = "Souce: CDC") +
geom_text(aes(label = death_percentage*100), size = 3, hjust = -0.1) +
NULL
ggsave("covid_plot1.png", width = 33.87, height = 18.15, units = "cm", dpi = 96) # ggsave to resize graph
new <- theme_set(theme_bw())
theme_replace(plot.tag.position = c(0.95,0.01),
plot.tag = element_text(size = 9)) # Repeat custom format for 2nd graph
covid_plot2_data <- covid_data %>%
#remove all observations where relevant data are missing
mutate_if(is.character, list(~na_if(., "Unknown"))) %>%
mutate_if(is.character, list(~na_if(., "Missing"))) %>%
mutate(sex = na_if(sex, "Other")) %>% # to ensure sex response "Other" is dropped with other NAs
drop_na(age_group, sex, icu_yn, death_yn) %>% # drop NAs
group_by(age_group, sex, icu_yn, death_yn) %>%
summarise(count = n()) %>%
ungroup() %>%
group_by(age_group, sex, icu_yn) %>%
mutate(subgroup_total = sum(count)) %>%
ungroup() %>%
filter(death_yn == "Yes") %>%
mutate(death_percentage = round(count/subgroup_total, digits = 3),
icu_yn = factor(icu_yn, levels = c("Yes", "No"),
labels = c("Admitted to ICU", "No ICU"))) # Rename facet labels
covid_plot2 <- ggplot(covid_plot2_data, aes(x = death_percentage, y = reorder(age_group, age_group))) +
geom_col(fill = "#FF9999") +
facet_grid(icu_yn ~ sex) +
coord_cartesian(xlim = c(0, 0.85), clip = "off") +
scale_x_continuous(labels = scales::percent) +
theme_set(new) +
labs(title = "Covid death % by age group, sex and whether patient was admitted to ICU",
x = "",
y = "",
caption = "Souce: CDC") +
geom_text(aes(label = death_percentage*100), size = 3, hjust = -0.1) +
NULL
ggsave("covid_plot2.png", width = 33.87, height = 18.15, units = "cm", dpi = 96) # ggsave to resize graph
# display saved graphs
knitr::include_graphics(here::here("covid_plot1.png"), error = FALSE)knitr::include_graphics(here::here("covid_plot2.png"), error = FALSE)We will look at TfL data on how many bikes were hired every single day. We can get the latest data by running the following
url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"
# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2020-09-18T09%3A06%3A54/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20201004%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201004T192007Z&X-Amz-Expires=300&X-Amz-Signature=e5010ca348cffd3ce020fad043f1bea706a88ab7ac6ee28ad406448699ffc1ea&X-Amz-SignedHeaders=host]
## Date: 2020-10-04 19:20
## Status: 200
## Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
## Size: 165 kB
## <ON DISK> /var/folders/dw/7jxg1bl50110hhppf2gnfvkm0000gn/T//Rtmpb60ZYO/file121e1235697ad.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
sheet = "Data",
range = cell_cols("A:B"))
# change dates to get year, month, and week
bike <- bike0 %>%
clean_names() %>%
rename (bikes_hired = number_of_bicycle_hires) %>%
mutate (year = year(day),
month = lubridate::month(day, label = TRUE),
week = isoweek(day))We can easily create a facet grid that plots bikes hired by month and year.
We can clearly that 2020 has not been a normal year with especially low rental rates in May and June due to the Covid-19 lockdown.
We will investigate this further by looking into the percentage deviations from the long-term average by both month and week. The long-term average will be calculated by using the mean to ensure that all days in a given month are given the same weight.
# create monthly forecast based on historic averages
bike_exp <- bike %>%
filter(year %in% c(2015:2019)) %>% # exclude 2020 to avoid outliers
group_by(month) %>%
summarise(expected_rentals = mean(bikes_hired)) # use mean to calculate expected rentals
# create monthly actuals
bike_avg <- bike %>%
filter(year >= "2015") %>%
group_by(month, year) %>%
summarise(actual_rentals = mean(bikes_hired)) %>%
left_join(y = bike_exp, join_by = month) %>% # merge forecasts with actuals
mutate(excess_rentals = actual_rentals - expected_rentals, # calculate excess bikes rental
date_month = as.Date(paste0("2015-", match(month, month.abb),"-01"),"%Y-%m-%d")) # use a date format for graph. 2015 used for no reason, but wont show up
# Create graph
bike_month_surplus <- ggplot(bike_avg, aes(x = date_month, y = actual_rentals)) +
# create two line graphs
geom_line(aes(y = expected_rentals), color = "blue", size = 1) +
geom_line(aes(y = actual_rentals), color = "black", size = 0.25) +
# create ribbons for deviations. Deficits to forecast shall be red, surpluses will be green
geom_ribbon(aes(ymin = expected_rentals + pmin(excess_rentals,0), ymax = expected_rentals, fill = "Deficit", alpha = .2)) +
geom_ribbon(aes(ymin = expected_rentals, ymax = expected_rentals + pmax(excess_rentals,0), fill = "Surplus", alpha = .2)) +
scale_fill_manual(values=c("#CB454A","#7DCD85"), name="Deficit vs. Surplus") +
# facet wrap for year
facet_wrap( ~year) +
theme_bw(base_size = 15) +
scale_x_date(date_labels = "%b", date_breaks ="months") + # only show month on x-axis
# create custom theme
theme(title = element_text(size=8),
axis.text = element_text(size=8),
axis.ticks = element_blank(),
axis.title.x= element_blank(),
panel.background = element_rect(color="white", fill = "white"),
panel.border = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major = element_line(color="grey", size = 0.1),
strip.background = element_rect(color="white", fill="white"),
strip.text = element_text(size=8),
panel.grid = element_line(color = "#D3D3D3"),
legend.position = "none") +
labs(y = "Monthly bike rentals",
caption = "Source:TfL, London Data Store") +
ggtitle(label = "Monthly changes in TfL bike rentals",
subtitle = "Change in monthly average shown in blue \n and calculated between 2015-2019")
# Save graph as custom format
ggsave("TfL_month.png",
plot = last_plot(),
scale = 1,
width = 25,
height = 20,
units = "cm",
dpi = 300,
limitsize = TRUE)
knitr::include_graphics(here::here("TfL_month.png"), error = FALSE)# Calculate weekly forecasts as long term weekly average
bike_exp_week <- bike %>%
filter(year %in% c(2015:2019)) %>% # exclude 2020 to avoid outliers
group_by(week) %>%
summarise(expected_rentals = mean(bikes_hired))
# Create weekly actuals, merge forecast to actuals
bike_avg_week <- bike %>%
filter(year >= "2015") %>%
group_by(week, year) %>%
summarise(actual_rentals = mean(bikes_hired)) %>%
left_join(y = bike_exp_week, join_by = week) %>%
# Calculate weekly deviation from long-term average
mutate(excess_rentals = actual_rentals - expected_rentals,
perc_change = excess_rentals / expected_rentals)
# Create graph
bike_week_deviation <- ggplot(bike_avg_week, aes(x = week, y = perc_change)) +
# create quarter shading
geom_rect(xmin=13, xmax=26, ymin=-1, ymax=Inf,fill = "grey",alpha=0.01) +
geom_rect(xmin=39, xmax=53, ymin=-1, ymax=Inf,fill = "grey",alpha=0.01) +
# create deviation and forecast (y=0) lines
geom_line(color = "black", size = 0.25) +
geom_line(aes(y = 0), color = "black", size = 0.25) +
# Create ribbon to highligh Deficits and Surplusses in red and green respectively
geom_ribbon(aes(ymin = pmin(perc_change,0), ymax = 0, fill = "Deficit", alpha = .2)) +
geom_ribbon(aes(ymin = 0, ymax = pmax(perc_change,0), fill = "Surplus", alpha = .2)) +
scale_fill_manual(values=c("#CB454A","#7DCD85"), name="Deficit vs. Surplus") +
# Create rug at bottom of the graphs
geom_rug(aes(colour = ifelse(actual_rentals >= expected_rentals,">=0","<0")), sides = "b")+
scale_colour_manual(values=c("#CB454A","#7DCD85"),name="Deficit vs. Surplus", guide=FALSE) +
# facet wrap by year
facet_wrap( ~year) +
theme_bw(base_size = 15) +
# custom x and y axis scales
scale_y_continuous(labels = scales::percent) +
scale_x_continuous (limits = c(0, 53),
breaks = c(0, 13, 26, 39, 53),
labels = c("", "13","26","39","53")) +
# Create custom theme
theme(title = element_text(size=8),
axis.text = element_text(size=8),
axis.ticks = element_blank(),
axis.title.y= element_blank(),
panel.background = element_rect(color="white", fill = "white"),
panel.border = element_blank(),
panel.grid.major = element_line(color="grey", size = 0.1),
strip.background = element_rect(color="white", fill="white"),
strip.text = element_text(size=8),
panel.grid = element_line(color = "#D3D3D3"),
legend.position = "none") +
labs(x = "week",
caption = "Source:TfL, London Data Store") +
ggtitle(label = "Weekly changes in TfL bike rentals",
subtitle = "% changes from weekly averages \n calculated between 2015 - 2019")
# Save graph as custom format
ggsave("TfL_week.png",
plot = last_plot(),
scale = 1,
width = 25,
height = 20,
units = "cm",
dpi = 300,
limitsize = TRUE)
knitr::include_graphics(here::here("TfL_week.png"), error = FALSE)As usual, there is a lot of explanatory text, comments, etc. You do not need these, so delete them and produce a stand-alone document that you could share with someone. Knit the edited and completed R Markdown file as an HTML document (use the “Knit” button at the top of the script editor window) and upload it to Canvas.
Please seek out help when you need it, and remember the 15-minute rule. You know enough R (and have enough examples of code from class and your readings) to be able to do this. If you get stuck, ask for help from others, post a question on Slack– and remember that I am here to help too!
As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?
Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.
Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).
Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.